NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

FIDESlib: fully-fledged open-source FHE library for efficient CKKS on GPUs

Agullo-Domingo, Carlos; Vera-Lopez, Oscar; Guzelhan, Seyda; Daksha, Lohit; El_Jerari, Aymane; Shivdikar, Kaustubh; Agrawal, Rashmi; Kaeli, David; Joshi, Ajay; Abellan_Miguel, Jose_Luis (July 2025, IEEE International Symposium on Performance Analysis of Software and Systems)

Word-wise Fully Homomorphic Encryption (FHE) schemes, such as CKKS, are gaining significant traction due to their ability to provide post-quantum-resistant, privacy preserving approximate computing—an especially desirable feature in the Machine-Learning-as-a-Service (MLaaS) paradigm. In this work, we introduce FIDESlib, the first open-source server-side CKKS GPU library that is fully interoperable with well-established client-side OpenFHE operations. Unlike other existing open-source GPU libraries, FIDESlib provides the first implementation featuring heavily optimized GPU kernels for all CKKS primitives, including bootstrapping. Our library also integrates robust benchmarking and testing, ensuring it remains adaptable to further optimization. Comparing our scheme against Phantom (the previously top open-source CKK library, we show that FIDESlib offers superior performance and scalability. For bootstrapping, FIDESlib achieves no less than 70× speedup over the AVX-optimized OpenFHE implementation. FIDESlib is available on Github.
more » « less
Free, publicly-accessible full text available July 7, 2026
PIMnet: A Domain-Specific Network for Efficient Collective Communication in Scalable PIM

https://doi.org/10.1109/HPCA61900.2025.00116

Son, Hyojun; Jonatan, Gilbert; Wu, Xiangyu; Cho, Haeyoon; Shivdikar, Kaustubh; Abellán, José L; Joshi, Ajay; Kaeli, David; Kim, John (March 2025, Proceedings)

Processing-in-memory (PIM), where compute is moved closer to memory or data, has been explored to accelerate emerging workloads. Different PIM-based systems have been announced, each offering a unique microarchitectural organization of their compute units, ranging from fixed functional units to programmable general-purpose compute cores near memory. However, one fundamental limitation of PIM is that each compute unit can only access its local memory; access to “remote” memory must occur through the host CPU – potentially limiting application performance scalability. In this work, we first characterize the scalability of real PIM architectures using the UPMEM PIM system. We analyze how the overhead of communicating through the host (instead of providing direct communication between the PIM compute units) can become a bottleneck for collective communications that are commonly used in many workloads. To overcome this inter-PIM bank communication, we propose PIMnet – a PIM interconnection network for PIM banks that provides direct connectivity between compute units and removes the overhead of communicating through the host. PIMnet exploits bandwidth parallelism where communication across the different PIM bank/chips can occur in parallel to maximize communication performance. PIMnet also matches the DRAM packaging hierarchy with a multi-tier network architecture. Unlike traditional interconnection networks, PIMnet is a PIM controlled network where communication is managed by the PIM logic, optimizing collective communications and minimizing the hardware overhead of PIMnet. Our evaluation of PIMnet shows that it provides up to 85× speedup on collective communications and achieves a 11.8× improvement on real applications compared to the baseline PIM.
more » « less
Free, publicly-accessible full text available March 1, 2026
DEFCON: Deformable Convolutions Leveraging Interval Search and GPU Texture Hardware

https://doi.org/10.1109/IPDPS57955.2024.00063

Jayaweera, Malith; Li, Yanyu; Wang, Yanzhi; Ren, Bin; Kaeli, David (May 2024, IEEE)

Full Text Available
Multipath Mitigation via Clustering for Position Estimation Refinement in Urban Environments

https://doi.org/10.33012/2024.19605

Gutierrez, Julian; Gilabert, Russell; Dill, Evan; Hernandez, Guillermo; Kaeli, David; Closas, Pau (May 2024, ION)

Full Text Available
Energy-Aware Tile Size Selection for Affine Programs on GPUs

https://doi.org/10.1109/CGO57630.2024.10444795

Jayaweera, Malith; Kong, Martin; Wang, Yanzhi; Kaeli, David (March 2024, IEEE)
Digital Avatars: Framework Development and Their Evaluation

https://doi.org/10.24963/ijcai.2024/1031

Rupprecht, Timothy; Chang, Sung-En; Wu, Yushu; Lu, Lei; Nan, Enfu; Li, Chih-hsiang; Lai, Caiyue; Li, Zhimin; Hu, Zhijun; He, Yumei; et al (August 2024, International Joint Conferences on Artificial Intelligence Organization)

We present a novel prompting strategy for artificial intelligence driven digital avatars. To better quantify how our prompting strategy affects anthropomorphic features like humor, authenticity, and favorability we present Crowd Vote - an adaptation of Crowd Score that allows for judges to elect a large language model (LLM) candidate over competitors answering the same or similar prompts. To visualize the responses of our LLM, and the effectiveness of our prompting strategy we propose an end-to-end framework for creating high-fidelity artificial intelligence (AI) driven digital avatars. This pipeline effectively captures an individual's essence for interaction and our streaming algorithm delivers a high-quality digital avatar with real-time audio-video streaming from server to mobile device. Both our visualization tool, and our Crowd Vote metrics demonstrate our AI driven digital avatars have state-of-the-art humor, authenticity, and favorability outperforming all competitors and baselines. In the case of our Donald Trump and Joe Biden avatars, their authenticity and favorability are rated higher than even their real-world equivalents.
more » « less
Full Text Available
MaxK-GNN: Extremely Fast GPU Kernel Design for Accelerating Graph Neural Networks Training

https://doi.org/10.1145/3620665.3640426

Peng, Hongwu; Xie, Xi; Shivdikar, Kaustubh; Hasan, Md Amit; Zhao, Jiahui; Huang, Shaoyi; Khan, Omer; Kaeli, David; Ding, Caiwen (April 2024, ACM)

Full Text Available
Scalability Limitations of Processing-in-Memory using Real System Evaluations

https://doi.org/10.1145/3639046

Jonatan, Gilbert; Cho, Haeyoon; Son, Hyojun; Wu, Xiangyu; Livesay, Neal; Mora, Evelio; Shivdikar, Kaustubh; Abellán, José L; Joshi, Ajay; Kaeli, David; et al (February 2024, Proceedings of the ACM on Measurement and Analysis of Computing Systems)

Processing-in-memory (PIM), where the compute is moved closer to the memory or the data, has been widely explored to accelerate emerging workloads. Recently, different PIM-based systems have been announced by memory vendors to minimize data movement and improve performance as well as energy efficiency. One critical component of PIM is the large amount of compute parallelism provided across many PIM nodes'' or the compute units near the memory. In this work, we provide an extensive evaluation and analysis of real PIM systems based on UPMEM PIM. We show that while there are benefits of PIM, there are also scalability challenges and limitations as the number of PIM nodes increases. In particular, we show how collective communications that are commonly found in many kernels/workloads can be problematic for PIM systems. To evaluate the impact of collective communication in PIM architectures, we provide an in-depth analysis of two workloads on the UPMEM PIM system that utilize representative common collective communication patterns -- AllReduce and All-to-All communication. Specifically, we evaluate 1) embedding tables that are commonly used in recommendation systems that require AllReduce and 2) the Number Theoretic Transform (NTT) kernel which is a critical component of Fully Homomorphic Encryption (FHE) that requires All-to-All communication. We analyze the performance benefits of these workloads and show how they can be efficiently mapped to the PIM architecture through alternative data partitioning. However, since each PIM compute unit can only access its local memory, when communication is necessary between PIM nodes (or remote data is needed), communication between the compute units must be done through the host CPU, thereby severely hampering application performance. To increase the scalability (or applicability) of PIM to future workloads, we make the case for how future PIM architectures need efficient communication or interconnection networks between the PIM nodes that require both hardware and software support.
more » « less
Full Text Available
GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption

https://doi.org/10.1145/3613424.3614279

Shivdikar, Kaustubh; Bao, Yuhui; Agrawal, Rashmi; Shen, Michael; Jonatan, Gilbert; Mora, Evelio; Ingare, Alexander; Livesay, Neal; AbellÁN, JosÉ L; Kim, John; et al (October 2023, ACM)

Full Text Available
Accelerating Polynomial Multiplication for Homomorphic Encryption on GPUs

https://doi.org/10.1109/SEED55351.2022.00013

Shivdikar, Kaustubh; Jonatan, Gilbert; Mora, Evelio; Livesay, Neal; Agrawal, Rashmi; Joshi, Ajay; Abellan, Jose L.; Kim, John; Kaeli, David (September 2022, 2022 IEEE International Symposium on Secure and Private Execution Environment Design (SEED))

Full Text Available

« Prev Next »

Search for: All records